In the rapidly evolving digital education ecosystem, virtual classrooms require interactive and inclusive technologies to enhance teaching effectiveness. This paper presents Virtual Air Drawing using Computer Vision and Real-Time Captions, a system that enables instructors to draw in mid-air using hand gestures while simultaneously generating live captions from speech. The proposed approach utilizes computer vision–based hand tracking with MediaPipe to accurately capture gestures and render them onto a virtual canvas in real time. In addition, speech-to-text technology converts spoken content into captions, improving accessibility for learners with hearing impairments and language barriers. By integrating visual interaction with real-time textual support, the system enhances engagement, clarity, and inclusivity in online learning environments. The proposed solution is well suited for virtual classrooms, collaborative learning platforms, and interactive e-learning applications
Introduction
The text presents a Virtual Air Drawing with Computer Vision and Real-time Captioning system designed to enhance interactivity and accessibility in online education. Traditional virtual teaching tools rely on static inputs like keyboards and slides, which limit engagement. To address this, the proposed system enables instructors to draw in the air using hand gestures captured via a webcam and processed with computer vision, providing a more natural and intuitive teaching experience.
The system integrates gesture-based air drawing with live speech-to-text captioning, improving inclusivity for learners with hearing impairments and language barriers. MediaPipe is used for accurate hand tracking and gesture recognition, while speech-to-text technology generates synchronized real-time captions. By combining visual drawings and textual information, the system supports diverse learning styles and improves concept explanation in virtual classrooms.
The literature survey highlights that while earlier air-drawing methods suffered from lighting and background sensitivity, modern deep-learning-based frameworks like MediaPipe have significantly improved accuracy. However, most existing systems lack accessibility features such as live captioning. This work fills that gap by unifying air drawing and real-time captions into a single framework.
The proposed system architecture includes modules for video capture, hand tracking, gesture recognition, drawing rendering, caption generation, preprocessing, and visualization. Gesture recognition is performed using a CNN–LSTM model, while caption generation uses a Seq2Seq model with attention, enabling context-aware descriptions of drawings.
Performance evaluation shows high accuracy for gesture recognition (up to 98% under optimal lighting with high-resolution cameras) and strong caption accuracy (up to 92% in quiet environments). Although performance slightly decreases in low light and noisy conditions, the system remains stable and responsive in real-time use.
Conclusion
The Virtual Air Drawing system successfully combines hand gesture recognition and real-time captioning, providing accurate, intuitive, and responsive drawing. High-resolution cameras and optional mouse-assisted controls enhanced precision and minimized errors. Real-time captions improved accessibility and user experience, making the system suitable for digital art, education, and assistive applications.
References
[1] A. Dash, A. Sahu, R. Shringi, J. Gamboa, M. Z. Afzal, M. I. Malik, A. Dengel, and S. Ahmed, “AirScript – Creating Documents in Air,” 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), pp. 1–6, 2017.
[2] P. Rai, R. Gupta, V. Dsouza, and D. Jadhav, “Virtual Canvas for Interactive Learning using OpenCV,” 2022 IEEE 3rd Global Conference for Advancement in Technology (GCAT), pp. 1–5, 2022.
[3] A. R. Elshenaway and S. K. Guirguis, “On–Air Hand–Drawn Doodles for IoT Devices Authentication During COVID-19,” IEEE Access, Nov. 30, 2021, doi: 10.1109/ACCESS.2021.3131551.
[4] A. Mohanarathinam, K. G. Dharani, R. Sangeetha, G. Aravindh, and P. Sasikala, “Study on Hand Gesture Recognition Using Machine Learning,” Fourth International Conference on Electronics, Communication and Aerospace Technology (ICECA), pp. 1–6, 2020.
[5] W. Choi, J. Chen, and J. Yoon, “Step by Step: A Gradual Approach for Dense Video Captioning,” IEEE Access, May 24, 2023, doi: 10.1109/ACCESS.2023.3279816.
[6] J. H. Seong and Y. Choi, “Design and Implementation of User Interface through Hand Movement Tracking and Gesture Recognition,” ICTC, pp. 1–5, IEEE, 2018.